Predicting Quality of Portuguese Vinho Verde White Wine

Group 009E05

Mason Feng

The University of Sydney

Model Selection

In order to create our model, we must first choose our predictor variables. We will compare 2 methods:

  • Stepwise Selection
  • Lasso Regression

We will compare these models using \(RMSE\), \(MAE\), \(R^2\), \(AIC\) and \(BIC\). From our EDA we noticed that both “alcohol” and “residual.sugar” have high correlations with “density”. This means that we need to be careful of potential multicollinearity and we may want to remove “density” from our model.

What is multicollinearity?

Multicollinearity means that our predictors are correlated. We want to reduce this in our model which will be done by checking the Variance Inflation Factor (VIF) of our variables.

Stepwise Selection

For this method we will be comparing both forward and backwards selection. We know from lectures, that this uses AIC as the performance metric. Lets perform stepwise selection and view our performance metrics:

M0 = lm(data$quality ~ 1, data = data)  # Null model
M1 = lm(data$quality ~ ., data = data)  # Full model
step.fwd.aic = step(M0, scope = list(lower = M0, upper = M1),
                    direction = "forward", trace = FALSE)
summary(step.fwd.aic)

Call:
lm(formula = data$quality ~ alcohol + volatile.acidity + residual.sugar + 
    free.sulfur.dioxide + density + pH + sulphates + fixed.acidity, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8246 -0.4938 -0.0396  0.4660  3.1208 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.541e+02  1.810e+01   8.514  < 2e-16 ***
alcohol              1.932e-01  2.408e-02   8.021 1.31e-15 ***
volatile.acidity    -1.888e+00  1.095e-01 -17.242  < 2e-16 ***
residual.sugar       8.285e-02  7.287e-03  11.370  < 2e-16 ***
free.sulfur.dioxide  3.349e-03  6.766e-04   4.950 7.67e-07 ***
density             -1.543e+02  1.834e+01  -8.411  < 2e-16 ***
pH                   6.942e-01  1.034e-01   6.717 2.07e-11 ***
sulphates            6.285e-01  9.997e-02   6.287 3.52e-10 ***
fixed.acidity        6.810e-02  2.043e-02   3.333 0.000864 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7512 on 4889 degrees of freedom
Multiple R-squared:  0.2818,    Adjusted R-squared:  0.2806 
F-statistic: 239.7 on 8 and 4889 DF,  p-value: < 2.2e-16
step.back.aic = step(M1, 
                    direction = "backward", trace = FALSE)
summary(step.back.aic)

Call:
lm(formula = data$quality ~ fixed.acidity + volatile.acidity + 
    residual.sugar + free.sulfur.dioxide + density + pH + sulphates + 
    alcohol, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8246 -0.4938 -0.0396  0.4660  3.1208 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.541e+02  1.810e+01   8.514  < 2e-16 ***
fixed.acidity        6.810e-02  2.043e-02   3.333 0.000864 ***
volatile.acidity    -1.888e+00  1.095e-01 -17.242  < 2e-16 ***
residual.sugar       8.285e-02  7.287e-03  11.370  < 2e-16 ***
free.sulfur.dioxide  3.349e-03  6.766e-04   4.950 7.67e-07 ***
density             -1.543e+02  1.834e+01  -8.411  < 2e-16 ***
pH                   6.942e-01  1.034e-01   6.717 2.07e-11 ***
sulphates            6.285e-01  9.997e-02   6.287 3.52e-10 ***
alcohol              1.932e-01  2.408e-02   8.021 1.31e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7512 on 4889 degrees of freedom
Multiple R-squared:  0.2818,    Adjusted R-squared:  0.2806 
F-statistic: 239.7 on 8 and 4889 DF,  p-value: < 2.2e-16

Stepwise Performance Metrics

cbind(evaluate(y, step.fwd.aic$fitted.values, data),broom::glance(step.fwd.aic)[8:9])
\[\begin{array}{c|ccccc} & \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}& \textrm{BIC}\\ \hline \textrm{Stepwise Select} & 0.751 & 0.584 & 0.282 & 11108.29 & 11173.25\\ \end{array}\]

We notice from the previous slide that both forward and backward select choose the same model, which result in the exact same performance metrics. However, we haven’t dealt with the issue of multicollinearity. We will now check using VIF.

Dealing with multicollinearity

VIF is a measure of multicollinearity in the model, where higher values signify higher correlation (which we want to avoid!). The formula is as follows:

\[VIF_i = \frac{1}{1-R^2_i}\]

Generally we want to ensure that every variable has a \(VIF\) that is \(<5\). We use the vif() function from the car library on our model object:

vif(step.fwd.aic)
            alcohol    volatile.acidity      residual.sugar free.sulfur.dioxide 
           7.622843            1.057310           11.854253            1.149027 
            density                  pH           sulphates       fixed.acidity 
          26.123154            2.113597            1.129688            2.579640 

We observe that we have three variables that have \(VIF>5\): “alcohol”, “residual.sugar” and “density”, which is as expected from our EDA. To reduce our multicollinearity, we can remove variables with high multicollinearity, but even then, stepwise selection has a multitude of statistical problems. What if there was a better way of variable selection?

Lasso Regression

Least Absolute Shrinkage and Selection Operation or LASSO, is a regression method utilising \(\ell_1\)-regularisation where the parameters of the regression model are:

\[\beta^{lasso}_\lambda = \underset{\beta}{\operatorname{\arg\max}} \Biggl\{ \underbrace{\sum_{i=1}^n\Biggl( y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\Biggr)^2}_{\text{Residual Sum of Squares}\; (RSS)}+\lambda\sum_{j=1}^p|\beta_j| \Biggr\}\]

LASSO is often used when we also need to do variable selection, since it tends to generate 0 for some coefficients for large enough values of \(\lambda\). Therefore, we can use LASSO to perform our variable selection and modelling at the same time. We also do not need to worry about multicollinearity as LASSO shrinks the coefficients of these problematic variables to 0.

Main idea behind LASSO

The main idea of LASSO is that we introduce a small amount of Bias into the way we fit our model. But in return for that small amount of Bias we get a significant drop in Variance.

Remembering the LASSO regression formula, we see that we need to choose a suitable hyper-parameter \(\lambda\), which is done through Cross Validation e.g. 10-fold CV.

Performing LASSO regression

set.seed(2002)

cv_lasso = cv.glmnet(x, y, alpha = 1, standardize = TRUE, nfolds = 10)

plot(cv_lasso)

Performing LASSO regression

Hyperparameter choice from CV

cv_lasso$lambda.min
[1] 0.002537796

Selected coefficients for our Lasso model

coef(cv_lasso)
12 x 1 sparse Matrix of class "dgCMatrix"
                               s1
(Intercept)           2.732099287
fixed.acidity        -0.039233911
volatile.acidity     -1.750671036
citric.acid           .          
residual.sugar        0.015682518
chlorides            -0.547985113
free.sulfur.dioxide   0.002585403
total.sulfur.dioxide  .          
density               .          
pH                    0.035647346
sulphates             0.202078813
alcohol               0.335047738

Performance Metrics on LASSO

best_model = glmnet(x, y, alpha=1, lambda = cv_lasso$lambda.min)
y_predicted <- predict(best_model, newx=x)

tLL = best_model$nulldev - deviance(best_model)
k = best_model$df
n = best_model$nobs

AIC = -tLL+2*k
BIC = log(n)*k - tLL

cbind(evaluate(y, y_predicted, data), data.frame(AIC = AIC,BIC = BIC))
\[\begin{array}{c|ccccc} & \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}& \textrm{BIC}\\ \hline \textrm{Stepwise Select} & 0.751 & 0.584 & 0.282 & 11108.29 & 11173.25\\ \textrm{LASSO} & 0.751 & 0.584 & 0.281 & -1059.98 & -995.02\\ \end{array}\]